10 research outputs found

    A Continuously Growing Dataset of Sentential Paraphrases

    Full text link
    A major challenge in paraphrase research is the lack of parallel corpora. In this paper, we present a new method to collect large-scale sentential paraphrases from Twitter by linking tweets through shared URLs. The main advantage of our method is its simplicity, as it gets rid of the classifier or human in the loop needed to select data before annotation and subsequent application of paraphrase identification algorithms in the previous work. We present the largest human-labeled paraphrase corpus to date of 51,524 sentence pairs and the first cross-domain benchmarking for automatic paraphrase identification. In addition, we show that more than 30,000 new sentential paraphrases can be easily and continuously captured every month at ~70% precision, and demonstrate their utility for downstream NLP tasks through phrasal paraphrase extraction. We make our code and data freely available.Comment: 11 pages, accepted to EMNLP 201

    Peculiarities of the inverted repeats in the complete chloroplast genome of Strobilanthes bantonensis Lindau

    No full text
    Strobilanthes bantonensis Lindau belongs to the family Acanthaceae. It is an antiviral herb that can be used to prevent Influenza virus infections in the border areas between China and Vietnam. Local people call it ‘Purple Ban-lan-gen’ because its root is very similar to that of Strobilanthes cusia (Nees) Kuntze, which is called ‘Southern Ban-lan-gen’ and is listed in Chinese Pharmacopeia. The two species have been used interchangeably locally. However, their pharmacological equivalence has caused concern for years. We have sequenced the chloroplast genome of S. cusia previously. In this study, we sequenced the complete chloroplast genome sequence of S. bantonensis to preform in-depth comparative genetic analysis of the two Strobilanthes species. The chloroplast genome of S. bantonensis is a circular DNA molecule with a total length of 144,591 bp and encodes 84 protein-coding, 8 ribosomes, and 37 transfer RNA genes. The chloroplast genome has a conservative quadripartite structure, including a large single-copy (LSC) region, a small single-copy (SSC) region, and a pair of inverted repeat (IR) regions, with lengths of 92,068 bp, 17,767 bp, and 17,378 bp, respectively. Phylogenetic analysis confirmed that S. bantonensis is closely related to the S. cusia. Compared with other species from Acanthaceae, S. bantonensis has a significantly shortened IR region, suggesting the occurrence of IR contraction events. This study will help future taxonomic, evolutionary, phylogenetic, and bioprospecting studies of the sizeable Strobilanthes genus, which contains over 400 species

    UNITE: A Unified Benchmark for Text-to-SQL Evaluation

    Full text link
    A practical text-to-SQL system should generalize well on a wide variety of natural language questions, unseen database schemas, and novel SQL query structures. To comprehensively evaluate text-to-SQL systems, we introduce a UNIfied benchmark for Text-to-SQL Evaluation (UNITE). It is composed of publicly available text-to-SQL datasets, containing natural language questions from more than 12 domains, SQL queries from more than 3.9K patterns, and 29K databases. Compared to the widely used Spider benchmark, we introduce ∌\sim120K additional examples and a threefold increase in SQL patterns, such as comparative and boolean questions. We conduct a systematic study of six state-of-the-art (SOTA) text-to-SQL parsers on our new benchmark and show that: 1) Codex performs surprisingly well on out-of-domain datasets; 2) specially designed decoding methods (e.g. constrained beam search) can improve performance for both in-domain and out-of-domain settings; 3) explicitly modeling the relationship between questions and schemas further improves the Seq2Seq models. More importantly, our benchmark presents key challenges towards compositional generalization and robustness issues -- which these SOTA models cannot address well. Our code and data processing script are available at https://github.com/awslabs/unified-text2sql-benchmarkComment: 5 page

    ABIN1 (Q478) is Required to Prevent Hematopoietic Deficiencies through Regulating Type I IFNs Expression

    No full text
    Abstract A20‐binding inhibitor of NF‐ÎșB activation (ABIN1) is a polyubiquitin‐binding protein that regulates cell death and immune responses. Although Abin1 is located on chromosome 5q in the region commonly deleted in patients with 5q minus syndrome, the most distinct of the myelodysplastic syndromes (MDSs), the precise role of ABIN1 in MDSs remains unknown. In this study, mice with a mutation disrupting the polyubiquitin‐binding site (Abin1Q478H/Q478H) is generated. These mice develop MDS‐like diseases characterized by anemia, thrombocytopenia, and megakaryocyte dysplasia. Extramedullary hematopoiesis and bone marrow failure are also observed in Abin1Q478H/Q478H mice. Although Abin1Q478H/Q478H cells are sensitive to RIPK1 kinase–RIPK3–MLKL‐dependent necroptosis, only anemia and splenomegaly are alleviated by RIPK3 deficiency but not by MLKL deficiency or the RIPK1 kinase‐dead mutation. This indicates that the necroptosis‐independent function of RIPK3 is critical for anemia development in Abin1Q478H/Q478H mice. Notably, Abin1Q478H/Q478H mice exhibit higher levels of type I interferon (IFN‐I) expression in bone marrow cells compared towild‐type mice. Consistently, blocking type I IFN signaling through the co‐deletion of Ifnar1 greatly ameliorated anemia, thrombocytopenia, and splenomegaly in Abin1Q478H/Q478H mice. Together, these results demonstrates that ABIN1(Q478) prevents the development of hematopoietic deficiencies by regulating type I IFN expression
    corecore